Digital Preservation and Permanent Access: The UVC for Images
نویسندگان
چکیده
Since early 2003, the Koninklijke Bibliotheek (KB) maintains a deposit service called e-Depot, based on the IBM product DIAS (Digital Information and Archiving System). By using this service, the KB has developed a workflow for archiving electronic publications and has implemented the other parts of the infrastructure in which the deposit system is embedded. Now that the infrastructure is in place and the service is operational, new plans and projects have started to technically and functionally extend the e-Depot. These include the development and implementation of a Preservation Manager, tools for permanent access to digital objects, and a new project that addresses massive storage and preservation of TIFF-images, delivered by museums and other cultural institutions. Thus, the e-Depot does not only provide a long-term solution for born-digital material (like e-journals), but for digitized objects as well. In this paper we will focus on the development and practical use of one of the permanent access tools: the Universal Virtual Computer (UVC) for images. Developed in collaboration with IBM as part of a new Preservation Subsystem for the e-Depot, the UVC may prove its value for the future rendering of images like JPEGs or TIFFs. We will explain how and why the UVC can be implemented in an operational digital archiving environment to provide permanent access; we will not elaborate on all technical details. For a fully technical description of this new approach, we refer to several articles by Raymond Lorie, the initiator of the UVC concept [5, 6, 7]. Copyright IS&T. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than IS&T must be honored. Abstracting with credit is permitted. To copy otherwise, to post on servers or to redistribute to lists, reuiqres prior specific permission and/or a fee. The KB e-Depot In 1999, the KB specified the system requirements for a fullscale deposit system, which were based on the ISO standard for digital archives: the Open Archival Information System [1]. As a result of an European tender procedure in 2000, the KB contracted the development of the deposit system to IBM The Netherlands. In October 2002 IBM delivered the system to the KB, by using as much as possible off-the-shelf components, like Tivoli Storage Manager, WebSphere and Content Manager, and branded it under the name Digital Information Archiving System (DIAS) [3]. By using DIAS, the KB maintains the service called the e-Depot [2, 9]. During the last two years Elsevier Science, Kluwer Academic Press, and BioMed Central have signed unique agreements with the KB on long-term digital archiving of their electronic publications. At this moment the digital publications of these publishers are being loaded into the e-Depot, involving more than 2,500 journals, containing over 6 million articles. Two types of electronic publications are currently processed: offline media like CD-ROMs that are completely installed, before they are loaded into the e-Depot (including operating system and additionally needed software), and online media like the high-volume of electronic articles deposited by the publishers [8, 9]. Ingest of installables is a time-consuming process; first, the CD-ROM is completely installed on a Reference Workstation (RWS), including all additionally needed software, like image viewers or media players. A snapshot of the installed CD-ROM together with the operating system on which it was installed, is then generated into a disk image. After manual cataloguing, this disk image is subsequently loaded into the e-Depot. If an end user wants to view a particular CD-ROM, the entire disk image is retrieved from the e-Depot, and installed on a dedicated RWS. By including the operating system in the stored package, the CD-ROM is guaranteed to work – also under future circumstances with new operating systems, and as long as compatible hardware is operational. The second type of publications which is currently processed, is online media. These publications are either sent to the KB on tapes or captured by means of FTP. In both cases, publications ready for ingest end up in an electronic post office in which they are validated. At this stage the content of the submission is checked on wellformedness, based on specifications agreed upon earlier. If the material does not match with the checksum (or if other errors occur), the content is passed to a database for error recovery (BER). If the content appears to be valid, content and metadata are put together as Publisher Submission Packages (PSPs), and these PSPs are then processed by the Batch Builder. The Batch Builder itself consists of a series of applications, such as TSM, Content Manager, and WebSphere. See Figure 1 for a complete overview of the data flow. Figure 1: General e-Depot Data Flow. The Batch Builder ingests both the content and the metadata and converts the publisher’s bibliographical descriptions to the KB’s internal format, at the same time adding a National Bibliographic Number (NBN). The NBN identification functions as the unique identifier of every digital item stored. The content itself is stored in the e-Depot, while the metadata is copied and stored in the KB catalogue. End users may query the online catalogue and retrieve the full text of the publications, in case of restrictions imposed by the publisher only after a process of identification, authentication and authorization (IAA). The e-Depot itself cannot be accessed directly, but passes relevant publications to the end-user after verification. Recently a new project has started to explore the possibilities of storing digitized images in the e-Depot. The system already accepts submissions in TIFF format, but a new workflow and external connections have to be designed. The TIFF masters belonging to Dutch cultural heritage institutions are stored in the e-Depot, and these images should be managed by these institutions themselves. The institutions should provide the metadata, including responsibility for the selection process and access policy. They should also be able to retrieve the stored images at any time, so a remote (but restricted and secure) connection is required, capable of transferring very large data files. Preservation Management: Signaling By building and organizing the digital archiving infrastructure, the KB has ensured the safe keeping of deposited digital objects. To ensure future accessibility as well, two issues are of major importance. First, we need a standardized description of the technical properties of stored digital documents (file formats), and secondly, the development of tools for permanent access is also required. The first extension of the system in this respect focussed on preservation management. The reason for this is that the problem of digital obsolescence has to be carefully administered, before it can be tackled. Therefore, IBM and KB developed the Preservation Manager, thus adding preservation management functionality to the e-Depot. The Preservation Manager is a module that monitors the technical environment associated with the digital objects stored in the e-Depot. It signals specific technology changes and defines their consequences. The Preservation Manager administers all the information needed (like technical metadata of the file format, information about the operating system, hard ware, etc.) to render a digital object – in current and future environments. In order to describe the subsequent parts of the current IT environment needed to render the object, the Preservation Manager makes use of View Paths. View Paths are instances of an abstract model called the Preservation Layer Model or PLM, cf. Figure 2. Figure 2: The Preservation Layer Model. This structure of PLMs and View Paths is a way to store technical metadata on file format level. Describing the technical properties of a digital object, together with keeping record of the history of the object (provenance) is of major importance for the long term storing and future rendering of the digital documents. Every level in the PLM can be instantiated, thus generating a specific View Path that specifies which software and hardware is necessary to render the digital item. A PDF-file can be rendered in numerous ways: for instance, by using Acrobat Reader that runs on a Intel machine with Windows 95 as an operating system. This PDF can also be viewed on an IBM RS/6000, running AIX 4.2 including an AIX PDF viewer. Both are examples of specific View Paths, and by administering the valid View Paths, we get an overview of all possible ways to render a specific file type. After defining View Paths, the Preservation Manager helps us to monitor the consequences of technology changes. If Windows 95 appears to become obsolete, the View Paths that are Windows 95 depended can be automatically determined, and will be marked as obsolete. If, for a specific format, too many View Paths are endangered, due to obsolescence of any software or hardware, actions have to be performed to safeguard the documents in this format type. Figure 3. Specific View Paths get obsolete. Preservation Action: the UVC approach The choice for any particular preservation strategy depends on contemporary technical possibilities, but mostly on the goal of the preservation, and the type of the digital object in question. What functionality and what type of information do we want to offer future generations? What is it exactly that we want to render in the future? The choice lies between keeping the original digital object accessible, including all functionality on the one hand, and offering a derivative of the object, for instance limiting future access to mere readability on the other hand. And on top of this, there is even the possibility we might want to provide new, yet unknown functionalities. The KB has made the choice to provide access to stored digital publications in the way they are deposited by publishers. In other words: we want to keep the authentic publications accessible, or representations as close to the original as possible. The consequence of this decision is that the number of possible strategies is limited. Migration and data conversion change the authentic digital object. These strategies would therefore not be our first choice but may have to be considered. Conversion will probably have to be applied as part of an intermediate solution, as we will explain later, but since our aim is to try to present the original publication, emulation would be the main strategy to use. A side-effect of developing emulation based tools is that it will offer a ‘backwards’ solution: this way digital objects can be rendered in the future, even if they have been left unattended for a long time. However, emulation has never been operationalised in a digital preservation environment. Apart from its experimental nature, emulation in itself is not a single strategy, but can be defined and executed in many different forms, and with many possible intermediate solutions. The Universal Virtual Computer (UVC) is such an intermediate solution [6, 7]. It offers emulation in the sense that it aims at ressembling the original. It is also conversion in the way that a translation is made of the original file to an easy to understand, Logical Data View (LDV). This UVC data preservation approach may not be able to keep all behavior and functionality, but it does secure content and layout. The UVC is a virtual layer that is general enough to be applied to any thinkable computer architecture, and is described to facilitate the maintenance of computer programs through time. The UVC can be interpreted to develop a UVC emulator (also called interpreter) for any given platform. This future interpretation is based on an archived UVC description that is written in plain text [5]. The UVC architecture is based on concepts that have existed since the beginning of the computer era. Because it is virtual, it can be described in a simple and logical manner [7]. It is expected that the description of the UVC will be straightforward enough to enable future programmers to build an UVC emulator. To preserve this crucial piece of information, the description should be ‘written in stone’, or, more practical, it will be stored digitally, on paper, and microfilm as well. The actual interpretation through the UVC emulator would allow programs to run that have been written for the UVC. This way, these programs do not rely on any specific installation, time or environment, thus securing digital longevity. How can programs running on the UVC, help to render digital publications that are stored today? By using this specific program to translate an original data file to a socalled Logical Data View (LDV), rendering of these publications on future platforms is facilitated. Such a program is called a Decoder and has to be developed in the present, while being stored along with the original publication. To test the Decoder, an UVC emulator is built for a current platform, so the LDV can be generated and tested extensively. This way the quality of the output is secured. The Decoder can be executed in the future and will behave exactly the same as it was tested, because it will run on a UVC emulator as well. The LDV can be understood at any time in the future, because it is based on basic principles. In order to render the LDV in the future, in a way that looks like the original digital object, a viewer has to be built that should be able to run on the future platform. Future programmers should be able to build this type of viewer, because of the simple structure of the LDV, and because of a Logical Data Schema that will be archived with the Decoder and explains the structure of the LDV. To sum up, at the archiving stage, we need 3 components, apart from the original document: 1. The specification to build a UVC emulator 2. A Logical Data Schema to interpret logical data (LDS) 3. A UVC Decoder Figure 4. UVC data flow during archiving. The LDV will be generated by executing the Decoder in the delivery stage. In order to be able to run this program, first a UVC emulator has to be build for the current environment. After the LDV is generated, a viewer can be developed that renders the LDV as described in the Logical Data Schema. Because the Decoder will work exactly the same as tested in the archiving stage, the correct delivery of the LDV is guaranteed. Practical implementation of UVC for images In order to develop an operational UVC, the KB/IBM team has chosen to start with a UVC for images. First, it is developed for the JPEG format; in a second phase this can be easily extended to TIFF format. Image formats are structured relatively straightforward, which makes it possible to develop a UVC Decoder within a limited time frame and for practical purposes. However, the KB e-Depot stores electronic publications, almost all of them in PDF. Developing the UVC for PDF, building on the 2002 Proof of Concept performed by Raymond Lorie for the KB, would therefore be preferred; however it would require considerable time and effort to be fully operational any time in the near future [5]. That is why we have chosen to start with building a UVC for JPEG and included a procedure for converting stored PDFs to JPEG. Figure 5. UVC at delivery stage. This solution is not restricted to PDF, but can be applied to all kinds of different static files: they can all be converted into images. For instance, to make this UVC applicable to Word documents, Word files can also be converted into JPEG. To apply the UVC to TIFF, conversion to JPEG will not be necessary. Only a small adjustment of the LDS and the Decoder is required to extend this solution to TIFF. So, by choosing to develop a permanent access solution for images, we will not only guarantee the future renderability of stored digitized material, but we will also take a first step in the direction of developing a rendering strategy for all fixed-format digital objects. Lining up with the e-Depot Currently DIAS and the e-Depot do not contain any preservation support other than the definition and management of a standard disk image of the Reference Workstations. This guarantees an environment that is able to render all the format types that are currently ingested. As explained earlier, we developed the Preservation Manager that administrates technical metadata of stored file formats. This module will be integrated into DIAS and the e-Depot shortly. The UVC for images will also be operational within the next few months, even though it will serve as an experimental procedure for the time being. To provide the link between the stored publications in PDF and the UVC for JPEG, the Preservation Processor will be build. This module converts archived PDF-articles to JPEG and reingests a converted Archival Information Package into the e-Depot. The Preservation Processor is the link between the permanent access tools and DIAS. If, for any reason, we should use conversion as a preservation strategy, the Preservation Processor can be adjusted to a more generic version: converting several different file formats to various other formats. Together, the Preservation Manager, the Preservation Processor and the UVC as a first tool for permanent access, are the components of our Preservation
منابع مشابه
Development of a Universal Virtual Computer (UVC) for long-term preservation of digital objects
Emulation has been proposed as a preservation strategy for longevity of digital objects. An advantage of emulation is the relative platform independency. Emulation with a high degree of platform independency requires a Universal Virtual Computer (UVC), as described by Lorie. This paper describes the realization of the enhanced UVC developed for the National Library of the Netherlands. The prese...
متن کاملمقایسه دقت فیلمهای Primax و Insight و تصاویر دیجیتالی CMOS-APS در تشخیص پوسیدگیهای جانبی دندانهای خلفی به صورت آزمایشگاهی
Introduction: Radiography is one of the most important diagnostic methods for evaluation of dental caries. On the other hand, the least amount of radiation along with high quality is a main gold standard. The purpose of this study was to compare the diagnostic accuracy of Primax, Insight Radiographic films, and direct digital images (Schick CMOS-APS), in detection of interproximal natural Carie...
متن کاملPermanent Pixels: Building Blocks for the Longevity of Digital Surrogates of Historical Photographs.René van Horik
The conversion of historical photographs, into digital surrogates facilitates the easy access to the intellectual content. Durable high quality digital master images available today enable the creation of specific derivatives in the future. As historical photographs often are very vulnerable and the conversion process is expensive, the creation of high quality digital master images is important...
متن کاملImproved Adaptive Median Filter Algorithm for Removing Impulse Noise from Grayscale Images
Digital image is often degraded by many kinds of noise during the process of acquisition and transmission. To make subsequent processing more convenient, it is necessary to decrease the effect of noise. There are many kinds of noises in image, which mainly include salt and pepper noise and Gaussian noise. This paper focuses on median filters to remove the salt and pepper noise. After summarizin...
متن کاملA field investigation of application of digital terrestrial photogrammetry to characterize geometric properties of discontinuities in open-pit slopes
In order to analyze the slope stability in open-pit mines, the structural parameters of rock mass such as persistence and spatial orientation of discontinuities are characterized through field surveys, which involve spending high costs and times as well as posing high risks of rock toppling and rock fall. In the present work, a new application of terrestrial digital photogrammetry is introduced...
متن کامل